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ABSTRACT 

We consider the recently introduced monochromatic reverse 
top-fc queries which asks for, given a new tuple q and a 
dataset X>, all possible top-fc queries on D U {q} for which q 
is in the result. Towards this problem, we focus on designing 
indexes in two dimensions for repeated (or batch) querying, 
a novel but practical consideration. We present the novel 
insight that by representing the dataset as an arrangement 
of lines, a critical fc-polygon can be identified and used ex- 
clusively to respond to reverse top-fc queries. We construct 
an index based on this observation which has guaranteed 
worst-case-logarithmic query cost. 

We implement our work and compare it to related ap- 
proaches, demonstrating that our index is fast in practice. 
Furthermore, we demonstrate through our experiments that 
a fc-polygon is comprised of a small proportion of the original 
data, so our index structure consumes little disk space. 

Categories and Subject Descriptors 

H. 3.1 [Information Systems Applications]: Content Anal- 
ysis and Indexing — indexing methods; F.2.2 [Analysis of 
Algorithms and Problem Complexity]: Nonnumerical 
Algorithms and Problems — geometrical problems and com- 
putations 

Keywords 

Reverse top-fc, top-fc depth, arrangements of lines, access 
methods 

I. INTRODUCTION 

Imagine a software engineering team in the early stages of 
developing a new single-player console game. Given aggres- 
sive timelines until product launch, they need to prioritise 
the development efforts. Market research reveals that games 
in this category are typically assessed by end-users in terms 
of the quality of the graphics and the intelligence of the AI. 
Furthermore, different users express different trade-offs in 
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terms of which of these two metrics they believe to be the 
most important. 

If most games in this category have focused on the de- 
velopment of graphics, then the graphics-focused market is 
perhaps saturated and the engineering team may encounter 
more success by focusing instead on the development of the 
AI. More broadly stated, the exposure, and indeed success, 
of a product depends largely on how well it ranks against 
other, similar products. The task with which the engineer- 
ing team is faced here, in fact, is to assess how to maximise 
"top-fc exposure", the breadth of top-fc queries for which their 
product is returned. 

Computing the top-fc exposure of a product is the objec- 
tive of a reverse top-fc query, introduced recently by Vla- 
chou et al. [5]. A traditional (linear) top-fc query is a weight 
vector (wi , W2) that assigns a weight to each attribute of the 
relation T>. The result set contains the fc tuples (ai, aa) 6 ~D 
for which W\ * di + W2 * 02 is highest. A reverse top-fc query, 
given as input a numerical dataset V, a value fc, and a new 
query tuple q, reports the set of traditional top-fc queries on 
T> U {q} for which q is in the result set. 

In this paper, we focus on two dimensions and on the 
version of the problem in which the infinitely many possi- 
ble traditional top-fc queries are considered^] Consequently 
here, the result of a reverse top-fc query is an infinite set 
of weight vectors, which throughout this paper we assume 
to be represented as a set of disjoint angular intervals. For 
example, the angular interval (7r/6,7r/4) describes the in- 
finitely many weight vectors with an angular distance from 
the positive a;-axis between n/6 and 7r/4, exclusive. 

A Broader Perspective 

The processing of a reverse top-fc query is in itself interest- 
ing, but it is also important to consider where it fits within 
the context of a broader workflow. That is to say, what 
prompts the query and what occurs after the query is exe- 
cuted affects how the query should be processed. This con- 
sideration of broader context motivates our work. 

A single reverse top-fc query executed on its own is in- 
formative but not very actionable. A more likely scenario 
is that an analyst is trying to compare the impact of many 
product options in order to gauge which might be the most 
successful among them. In the case of the game develop- 

x This version of the problem was termed a monochromatic 
reverse top-fc query by Vlachou et al.. The alternative is the 
bichromatic version in which the traditional top-fc queries 
to be considered are limited to those enumerated in a finite 
relation. 



ment scenario, it is more useful for the engineering team to 
evaluate many trade-offs between AI and graphics in order 
to compare what degrees of relative prioritisation will make 
the game stand out to the broadest range of end-users. 

With this in mind, we propose the first indexing-based 
solution to reverse top-fc queries. The computational ad- 
vantage of this approach is that the majority of the cost can 
be absorbed before the queries arrive. In contrast, the exist- 
ing techniques (of Vlachou et al. [9] and of Wang et al. 
inherently depend on knowledge of the current query, so the 
linear-cost computation must be restarted for each of the 
many queries in a batch. 

It is crucial to consider also what becomes of the out- 
put for a reverse top-fc query. If it is meant for direct hu- 
man consumption, then the end-user can only interpret a 
succinct representation. Fragmenting output intervals into 
many sub-intervals would overwhelm the user. The argu- 
ment we make here is that not every correct output is equiv- 
alent. In particular, the (sometimes severe and unsorted) 
fragmentation of output intervals by existing techniques is 
quite undesirable. 

To address the importance of output, we define maximal 
reverse top-k (maxRTOP) queries, in which adjacent output 
intervals of a reverse top-fc query must be merged. In this 
sense, each reported interval in the solution is maximal. A 
nice property of our index-based approach is that it natu- 
rally produces this higher quality, maximal output. 

Our Approach 

Our approach to the maxRTOP problem is to consume the 
cost of sorting an approximation of and conducting a plane 
sweep on the fc-skyband of T> (a subset of T> that we de- 
scribe in Definition^ in order to design an index with query 
cost guaranteed to be logarithmic in the size of the true fc- 
skyband. 

We achieve this through novel geometric insight into the 
problem. Conceptually, we transform each tuple of T> into 
a line in Euclidean space, constructing an arrangement of 
lines (i.e., set of vertices, edges, and faces based on their 
intersections; see Definition [5]). From that arrangement, we 
show that a critical star-shaped polygon can be extracted 
for each value of fc. The importance of this polygon is that 
if we apply the same transform to the new query tuple q to 
produce a line l q , then the maxRTOP response is exactly 
the intersection of l q with the interior of this polygon. So, 
with this insight, the challenge becomes to effectively index 
the polygon. 

A crucial observation that we derive is that the polygon 
has a particular form: in a convex approximation, the end- 
points of any edge must be within O(k) edges in the original 
polygon. Leveraging this insight permits our producing an 
index with guaranteed O(logn) query cost. 

Computationally, the construction and representation of 
an arrangement of lines is somewhat expensive. Instead, we 
demonstrate that the only tuples that could form part of 
the critical polygon are those among the fc-skyband of T>. 
So, we approximate the fc-skyband, sort these lines based 
on their x-intercept (the dominating cost of the algorithm) , 
and introduce a radial plane sweep algorithm to build the 
polygon index. 

Our query algorithm is a binary search on the convexi- 
fied polygon. The recursion is based on the slope of the 
query line compared to the convex hull at the recursion 



point. Once we discover the at most two intersections of 
the query line with the convex hull, we perform at most two 
O(k) sequential scans to derive the exact solution. We can 
efficiently process a batch of queries, because we need only 
intersect the transformed line for each query with the same 
star-shaped polygon using the same index structure. 

Summary of Our Contributions 

To the MRTOP problem, we make several substantial con- 
tributions: 

• The definition of maximal reverse top-fc (maxRTOP) 
queries, which accounts for the neglected post-processing 
that invariably must be done on the result set and 
that illustrates a major weakness in techniques that 
produce highly fragmented and uninterpretable result 
sets. 

• The first index-based approach to reverse top-fc queries, 
leveraging our thorough geometric analysis of the prob- 
lem and our resultant novel insight into the problem 
properties. This approach is the first to produce loga- 
rithmic query cost for reverse top-fc queries. 

• The creation, optimisation, and publishing of compa- 
rable implementations for the work of Vlachou et al., 
Wang et al., and us, and an experimental evaluation 
that compares each of the three algorithms for different 
data distributions. 

2. LITERATURE 

Monochromatic reverse top-fc queries are quite new, intro- 
duced by Vlachou et al. [9] and an example of the growing 
field of Reverse Data Management [B]. As yet, there are two 
algorithms (besides ours) to efficiently answer monochro- 
matic reverse top-fc queries, the one originally proposed by 
Vlachou et al. [5], and a subsequent algorithm proposed by 
Wang et al. [TT] . Both are linear-cost, two-dimensional al- 
gorithms. 

The algorithm of Vlachou et al., refined in their more re- 
cent work |10] , interprets each tuple as a point in Euclidean 
space and relies on the pareto-dominance relationships be- 
tween the query point q and the points in T>. In particular it 
groups the points of T> into those that dominate q, are dom- 
inated by q, and are incomparable to q. The second phase 
is to execute a radial plane sweep over the set of points that 
are incomparable to q in order to derive the exact solution. 
Given the nature of the algorithm, we believe an interest- 
ing research direction may be to incorporate the work of 
Zou and Chen [12] on the Pareto-Based Dominant Graph. 

Das et al. [3] describe a duality transform approach for tra- 
ditional top-fc queries. Wang et al. adapt this work into an 
algorithm that maintains a list of segments of l q as follows. 
First they transform the query into a dual line l q . Then, for 
each tuple p in V, they construct the dual line l p and split l q 
at its intersection point with l p . For each segment of l q , they 
maintain how many of the tuples in T> so far have a higher 
rank than q over that segment, discarding the segment as 
soon as the count exceeds k — 1. Their work reports an 
experimental order-of-magnitude improvement over that of 
Vlachou et al., a claim that our experiments independently 
verify. 

The foremost distinctions (other than techniques) of our 
work from these ( [9j [TT] ) is, first, that the majority of our 



computation is independent of q (query-agnostic), and, sec- 
ond, the recognition that the queries more plausibly are ex- 
ecuted in batches. Together, these distinctions legitimise 
the construction of an asymptotically faster index-based ap- 
proach. We also note that Wang et al. propose a cubic-space 
rudimentary index that materialises the solution to every 
query, but which cannot handle the case when q^D. 

Our approach in this paper, enabled by our earlier re- 
search on threshold queries [2], is based on arrangements, 
a central concept in computational geometry. We suggest 
the 1995 survey by Sharir 8 for the interested reader. It is 
particularly relevant because of its discussion of the compu- 
tation of zones in an arrangement (i.e., the set of cells inter- 
sected by a surface). The de facto standard for representing 
arrangements is the doubly connected edge list, which is de- 
tailed quite well in the introductory text of de Berg et al. [1] . 

Our analysis of the arrangement is centred around data 
depth and depth contours. Within Statistics, data depth is a 
well studied approach to generalising concepts like mean to 
higher dimensions and a number of different depth measures 
were recently evaluated against each other by Hugg et al. [4] . 
Top-A rank depth has not been studied, but is similar to ar- 
rangement depth, which is investigated by Rousseeuw and 
Hubert [7], particularly with regard to bounding and algo- 
rithmically computing the maximum depth of a point within 
an arrangement. It is important to note, however, that we 
deviate from these other concepts of data depth by setting 
the face containing the origin, rather than the external face, 
to have minimal depth and by not ensuring affine equiv- 
ariance. As a consequence, we cannot make the assertion 
about connectedness and monotonicity offered by the study 
of depth contours by Zuo and Serfling [13] . 

A last comment about related work pertains to litera- 
ture on the traditional top-A; query problem, surveyed by 
Ilyas et al. [5]- Results in that domain cannot be straight- 
forwardly applied here, as argued by Vlachou et al., because 
non-null solutions to a monochromatic reverse top-A query 
are infinite sets. 

3. PRELIMINARIES 

In this section we formally introduce the problem under 
study and define the scaffolding upon which this work relies. 

Throughout all this work, we assume queries are executed 
on a two-dimensional, numeric relation T> which is a set of 
tuples (ai £ R, a2 £ R). Tuples can also alternatively be 
conceived as points (01,02) in the Euclidean plane or as 
two-dimensional vectors (01,02). We assume |X>| is "large", 
and that k £ Z+ << \T>\. 

To begin, a traditional, linear top-A; query is a pair of 
weights wi,W2- The response is the set of k tuples in T>, 
which, when interpeted as vectors, have the largest dot prod- 
uct with (1111,11)2). That is to say: 

Definition 1. The response to a traditional, linear top-A; 
query, w — (w\,W2), is the set: 

TOP(w) = {v £ V : \{u £ V : u ■ w > v ■ w}\ < k}. 

The monochromatic reverse top-A; query, introduced by 
Vlachou et al. [5], which we refer to simply as a reverse top- 
k query in this paper, is a tuple q = (gi, 92) not necessarily 
in T>. The response is the set of traditional, linear top-A; 
queries on PU {q} for which q is in the result set. Formally: 



Definition 2. The response to a reverse top-A: query, 
q = (qi, §2), is the set of angles 

RTOP(q) = {6 £ [0, tt/2] : 

\{v eT> : V! +v 2 tan 9 > q x + g 2 tan0} < k}. 



We introduce now a more user-conscious problem defini- 
tion, that of a maximal reverse top-A: query. The response 
to q = (qi, 52) is the set of largest angular ranges for which 
every angle within the range is in the result of a reverse 
top-A; query, q. Formally: 

Definition 3. The response to a maximal reverse top-A; 
query, q = (qi,q2), is the set of open intervals: 

maxRTOP(q) = {(0 O > 0, 0i < tt/2) : 

00 RTOP(q) A 

01 RTOP(q) A 

V0 € (0 , 0i) ,0 6 RTOP(q)}. 

Additionally to these problem definitions, we define here 
a number of concepts with which in the subsequent sections 
we assume the reader is familiar. Specifically, we define here 
the nullspace of a vector, an arrangement of lines, the k- 
skyband of a set of points, and our novel concepts of top-k 
rank depth and top-k rank depth contours. 

Definition 4. The nullspace of a vector v — (wi,«a) is 
the set of vectors orthogonal to v : {u : u ■ v = 0}. In 
two dimensions, this is exactly the line y — —^x. The 
translated nullspace of v, given a positive real r , is the set 
of vectors {u : u ■ v = r} , or the line y = — — —x. 

Definition 5. An arrangement of a set of lines C, de- 
noted Ac, is a partitioning 0/R 2 into cells, edges, and ver- 
tices. Each cell is a connected component 0/R 2 \ C. Each 
vertex is an intersection point of some two lines h,h £ C. 
An edge is a line segment between two vertices of A. 

Definition 6. Consider the set St of tuples (01,02) in T> 
for which there are at least k other tuples with higher values 
of both ai and 02 (i.e., the set of points pareto- dominated 
by at least k other points). The A;-skyband ofT>, which we 
denote S%, is precisely the rest ofT): T>\Sk- 

Definition 7. The top-A; rank depth of a point p within 
an arrangement A, is the number of edges of A between p 
and the origin. That is to say, the depth of p is the number 
of intersections between edges of A and [0,p]. Similarly, the 
top-A; rank depth of a cell of A is the top-k rank depth of 
every point within that cell. 

Definition 8. A top-A; rank depth contour is the set of 
edges in an arrangement Ac that have top-k rank depth ex- 
actly k. We also refer to a top-A; rank depth contour as the 
A;-polygon of C, because, as we show later, the contour is a 
closed, star-shaped polygon. 

4. AN ARRANGEMENT VIEW 

The theme of this paper is to answer maxRTOP queries 
with logarithmic cost by means of a data structure featuring 



a largely sequential data layout and inspired by geometric 
analysis of the problem. In this section, we conduct that 
analysis and create the theoretical foundations for our cor- 
rectness proof of our access methods in Section [5] 

The approach taken in Vlachou et al. is to exploit the 
dominance relationship among points in T>. The approach 
taken in Wang et al. is to compare all points in T> to the 
query point q in the dual space. We take a very different 
approach. We transform the dataset into an arrangement 
of lines and demonstrate that embedded in the arrangement 
is a critical polygon Vk which partitions R 2 into points to 
include among and exclude from a maxRTOP query result. 
We show, too, that by applying the same transformation 
to the query to produce a line l q , maxRTOP(q) is given 
precisely by the intersection of l q with the interior of Vk- 

An equally important contribution of this section is that 
we derive properties of Vk that are critical for proving later 
the asymptotic performance of our access method. 

This section is thus divided into three subsections: the 
first describes the transformation of T> into a set of \D\ con- 
tours (Section 14. 1^ ; the second derives important properties 
oiVk fSection l4.2p ; and the third establishes the equivalence 
of the intersection test to the original maxRTOP problem 
(Section rO]). 

4.1 Ac and Top-fc Rank Depth Contours 

In this section we describe what is a top-fc rank depth 
contour and how it is constructed from a relation, T>. We 
illustrate how to construct the arrangement from T> and how 
to interpret the arrangement as a set of contours. First, in 
order to reason about T> in terms of an arrangement, we 
need to represent each tuple as a line such that the relative 
positions of the lines with respect to a ray from the origin 
reflects their top-fc ranking. This is precisely the property 
that is proferred by the translated nullspaces of each tuple, 
for any arbitrary real r. 

So, we convert the set of tuples (or, alternatively, vectors) 
T> into a set of lines by transforming each tuple v — (vi, V2) 
to the line v : y — — — —x. For a ray r in any direction, 
we can show that: 

Lemma 4.1. 7/ the depth of a point v is less than the depth 
of a point u in the direction of a ray r, then the rank of v 
for a traditional, linear top-fc query f is better than that of 
u. 

Proof. If the translated nullspace of v is closer to the 
origin than of u in the direction of r, then v ■ r = t — u ■ cf 
for some c > 1. Therefore, v • r> u • r. □ 

In fact, we can make a stronger claim: the depth of a 
point p is precisely its top-fc rank depth for a query in the 
direction of p if p happens to correspond to a point on an 
edge of the arrangement. 

Corollary 4.2. depth(p) = rank(p) forTOP{p). 

Proof. Let depth(p) be d. Then from the definition of 
top-A; rank depth there are d other baseplanes that will be 
sooner encountered by a ray emanating from O in the di- 
rection of p. From Lemma 14.11 we know that each of these 
has a better rank than p, so the rank of p is at best d. Also, 
from Lemma l4.1l we can conclude that p has a better rank 
than all those with translated nullspaces farther from the 
origin than that of p, so the rank of p is not greater than d, 
either. □ 



The fc'th contour of an arrangement is the set of all edges 
at the same depth. We wish to show that, in fact, the edges 
form a connected ring around the origin, thus forming a 
polygon. In order for this to be true, we need to show that 
in any direction there is exactly one point on the contour, 
and that the points are all adjacent to each other. This is 
the objective of the following three lemmata. 

Firstly, to demonstrate connectedness, it is important that 
top-fc rank depth is a monotone measure: 

Lemma 4.3. Top-k rank depth increases monotonically with 
Euclidean distance from O in any arbitrary direction. 

Proof. Consider two points p, q such that p lies on the 
line segment [0,g]. Every line in the arrangement that 
crosses [0,p] also crosses [0,q], so depth(q) > depth(p). □ 

Secondly, we need to show that a cell of depth i is unique 
in a given direction: 

Lemma 4.4. There is exactly one cell of depth i in any 
given direction from O, for reasonably small i. 

Proof. First, we show that there is at most one cell of 
depth i. This follows from the definition of top-fc rank depth. 
Assume for the sake of contradiction that there are two dis- 
joint cells, A and B, with depth i in the same direction. 
Without loss of generality, assume that A is nearer to O 
than B. Take some point a £ A. Then, from the definition 
of top-fc rank depth, we know that there are exactly i lines 
crossing the line segment [O, a]. Now consider some point 
b 6 B. Because A is nearer than B to O, clearly every line 
between a and O also crosses the line segment [0,b]. So, 
too, must the upper boundary of A, since A and B are dis- 
tinct. But then there are at least i + 1 lines crossing [O, b], 
which contradicts that B is at depth i. 

The assumption that i is reasonably small is to guarantee 
that there are sufficiently many tuples in T) that there are at 
least i tuples to return for a traditional top-fc query. This is 
enough to imply that there is an i-contour in every possible 
direction, so there must be at least one cell in our given 
direction at depth i, as well. □ 

Thirdly, we can now show that, in fact, all cells of depth 
i are connected and can thus form a contour: 

Corollary 4.5. All cells at the same top-k rank depth 
(< fcmaxy) are connected. 

Proof. This follows from Lemma [4.4l which implies that 
there are no discontinuities in the contour in any given di- 
rection. Observe, too, that for any cell there must be an 
adjacent cell with the same depth at every corner. The cor- 
ners correspond to directions in which the incident trans- 
lated nullspaces reverse order. So, since the top translated 
nullspace becomes a bottom translated nullspace and vice 
versa, the depth does not changeQ □ 

This is enough to establish that the fc'th contour of the 
arrangement is precisely a star-shaped polygon: 

2 Strictly speaking, the vertex/corner itself is a discontinuity, 
as there is no point in that direction with exactly the right 
number of crossing line segments, but this is infinitesimal in 
size and we ignore the issue because we return open intervals 
anyway. 



THEOREM 4.6. A contour is a star-shaped polygon. 

Proof. First, we know that the contour is connected and 
exists in every direction from O. Also, every point inside the 
polygon is visible from O, for if there were some point p that 
were not visible, then an edge of the boundary would cross 
[0,p]. However, this would imply that there are two cells at 
the same depth in the direction of p from O, contradicting 
Lemma R~4l □ 

Theorem l4.6l is quite important. It establishes that we can 
represent I? as a set of polygons with a unique depth i, each 
of which itself encodes the i'th ranked tuple for any possible 
traditional, linear top-fc query. If there is only one value fc of 
interest, then the entire dataset can be represented just by 
one polygon. In this next subsection, we show properties of 
the fc-polygon, including bounds on its size, and in the fol- 
lowing subsection describe how to use it in order to address 
the main question of this paper, maxRTOP queries. 

4.2 Properties of v k 

In order to be able to use Vk as a data structure, we have 
to evaluate properties of the polygon in order to evaluate 
asymptotic performance. As we will detail in the next sec- 
tion, our data structure will be a representation of Vk, so 
the number of edges and vertices in the polygon influences 
our access time. 

Also, to improve performance, our data structure includes 
a convex approximation of Vk (specifically the convex hull), 
and understanding the implications of this approximation is 
also important. 

Thirdly, we approximate the dataset T> by Sk, so under- 
standing the implications of this approximation is clearly 
important, as well. 

Gathering this understanding is the intent of these next 
three lemma. Specifically, they answer these three questions 
in order: 

Lemma 4.7. An arrangement ofm lines can produce con- 
tours at top-k rank depth i with no more than 0(m) edges. 

PROOF. Note from Theorem 14. 101 that for each line I de- 
rived from a tuple v, the edges it contributes to the fe'th 
contour are precisely the answer to a maxRTOP query of v 
on T>\{v}. From Proposition 14. 1 ll we know this can consist 
of at most two disjoint angular intervals; therefore, / can 
contribute at most two edges to the k'th contour. □ 

Lemma 4.8. A concave region between vertices of the con- 
vex hull of the k 'th contour's upper boundary can have at 
most 2k — 1 vertices. 

Proof. Notice that vertices of the convex hull of the 
contour's upper boundary are themselves at depth k — 1. 
Consider two such vertices, Vi, Vj, delimiting a concave re- 
gion. Any line that passes neither under Vi nor under Vj 
and is orthogonal to some non-zero vector from O cannot 
pass through the concave region's face, so the face is de- 
fined by at most 2k lines. This is, in fact, an arrangement, 
so Lemma 14.71 implies the bound on the number of cells in 
that arrangement that could possibly be at depth k and thus 
contribute a vertex to the concave region's boundary. □ 

Lemma 4.9. Only tuples in Sk can form part of the k- 
polygon. 



Proof. Tuples that are not among Sk are, by definition, 
among Sk ■ However, the tuples of Sk are those dominated 
by at least k other tuples. In order words, regardless of the 
traditional, linear top-fe query issued, there are at least k 
better ranked tuples. Consequently, the fc-polygon, which 
encodes the fe'th ranked tuples for all possible traditional, 
linear top-fc queries, clearly does not contain the tuples of 
Sk in any direction. □ 

4.3 A Transformed maxRTOP Query 

In the previous subsections we have demonstrated that 
a star-shaped polygon (the fc-polygon) can encode the fc'th 
best ranked tuple for all query directions. In this section, 
we demonstrate how to use the fc-polygon for maxRTOP 
queries. 

First, recall that the arrangement of lines was produced 
by transforming each tuple in T> to its translated nullspace, 
given some fixed but arbitrary r. Here, we prove that ap- 
plying the same transformation to a query q to produce a 
line l q and intersecting l q with the interior of Vk yields the 
directions in which q is among the result set of traditional, 
linear top-fc queries: 

Theorem 4.10. The response to a maxRTOP query, given 
query vector q = (51,92), is the component of q : y — 
— — —x which intersects the interior ofVk- 
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Proof. Recall from Theorem 14.21 that the fe'th contour 
corresponds exactly to the vectors of rank k and also from 
Lemma r4,3l that the contours increase in rank monotonically. 
Therefore, if we constructed a new arrangement which also 
contained q, the components of q which lay outside the fe'th 
contour would be directions in which the rank of q is greater 
than k. The inverse of this is the solution to the maxRTOP 
query. □ 

Consequently, it suffices to develop algorithms for solving 
the problem of identifying the segments of q which lie in- 
side the fe'th top-fc rank depth contour in order to solve the 
maxRTOP problem. An illustration of this is provided in 
Figure [U 

A final note regarding the properties of Vk is that: 

Proposition 4.11. The result in two dimensions of a maxR- 
TOP query consists of at most two continuous intervals. 

5. EFFICIENTLY ANSWERING maxRTOP 
QUERIES 

Having established the theoretical foundations in the pre- 
vious section, we present here our index structure and ac- 
cess method. A key insight that we derived earlier is that 
the maxRTOP response to q is the intersection of l q with 
the interior of Vk ■ Fittingly, then, our index structure is a 
representation of Vk and our access method is an efficient 
means of retrieving from the index the intersection points 
of l q with Vk- First we give a high-level overview of our 
algorithms and data structure and then present the precise 
details in the upcoming subsections. 

Not just any representation of Vk will suffice: it has to fa- 
cilitate the efficiency of the access method. We accomplish 
this by creating a binary search procedure to identify the in- 
tersections of lq with the convex hull of Vk ■ This leads to an 
efficient access method because we established Theorem l4.8l 



V 




Figure 1: An arrangement labelled with top-fc rank depth; the 2'nd contour, zoomed in, with its convex 
hull displayed by the dashed line; and the (coloured gray) result for the reverse top-2 query for the vector 
v = (5, 5/2) (whose baseplane is y = 2(r — Ax)/ 5). 



We also aim to achieve a very sequential data layout to im- 
prove read times. So, we have developed a data structure 
consisting of one ordered list of the vertices of the convex 
hull of Vk and one ordered list of ordered lists of Vk vertices 
not on the convex hull. We describe the index structure in 
Section 15.11 

Algorithmically speaking, there are two considerations. 
Of foremost importance is how to efficiently query the index 
structure, given l q fSection !5.3p . The second consideration is 
how to efficiently construct (Section l5.2|l the index structure 
described above. Let us begin by addressing the first. 

The idea is to exploit properties of the problem. Our 
binary search to discover the intersection points of l q with 
a convex polygon is of logarithmic cost. Furthermore, given 
the intersection points of l q with the convex hull of Vk, we 
can find the exact intersection of l q with Vk by comparing 
it with every edge "shaved off" by that convex hull edge. 
By Theorem 14.81 we know there are most O(k) such edges. 
Because of our sequential layout, a direct comparison to each 
of these O(k) edges is affordable. 

Our construction algorithm is a plane sweep algorithm. 
We sweep radially from the positive x-axis to the positive 
j/-axis, maintaining a list of all the lines in sorted order with 
respect to their intersection points on the sweep line. At any 
given moment during the plane sweep, the fc'th line in the list 
is the edge of the fc-polygon. So, identifying the fc-polygon 
is equivalent to identifying all the points at which the fc'th 
line in that list changes. These points are the vertices of 
the fc-polygon. Maintaining the convex hull of the polygon 
is fairly straight-forward if one maintains convexity as an 
invariant throughout the sweep. 

The expense of this algorithm is dominated by initially 
sorting all the lines with respect to their intersection points 
with the x-axis. We improve upon this by recognising that 
only tuples of the fc-skyband are relevant. So, at the cost 
of an extra sequential scan, we approximate the fc-skyband 
with perfect recall (i.e., ensure every true positive is in the 
approximation) and then construct Vk from that approxi- 



mation, rather than from all of T>. 

The approximation method exploits the work we have al- 
ready done in this paper. We note that if a tuple is in the 
fc-skyband of T>, then it must be in the fc-skyband of any 
subset of T>. So, we build our index structure on 2fc selected 
tuples from T> and then include in our approximation any 
tuples which have non-null maxRTOP query responses on 
that small index structure. 

Together, these algorithms and this data structure gives 
Theorem [5711 

Theorem 5.1. The two dimensional maxRTOP problem 
can be solved using C(logn + fc) query time with an index 
that requires 0(n) disk space. 

Under the practical assumption that fc is constant or C(log n), 
the above theorem implies that query cost is O(logn). 

5.1 The fc-Polygon Index Structure 

Facilitating logarithmic query time of the index largely 
depends on how the data is represented. Our idea is to 
exploit Theorem 14.81 in our representation. Let H denote 
the set of vertices of the convex hull of a fc-polygon, Vk- We 
maintain two arrays, which we collectively refer to as the 
dual-array representation of Vk- The first, which we call the 
convex hull array, contains the \H\ vertices of H, ordered 
anti-clockwise from the positive x-axis. The second array, 
which we call the concavity array, is of size \H\ — 1. The i'th 
entry contains a sequential list of the up to 2fc — 1 vertices 
of the fc-polygon between the i'th and (i + l)'st vertices of 
H. 

5.2 Construction of the fc-Polygon 

Although Section 14.21 suggests how to determine the fc- 
polygon of T> by first constructing an arrangement of lines 
and then extracting from it all the edges at a top-fc rank 
depth of fc, here we describe a much more efficient algorithm. 
The key insight is that the only tuples that could form part 
of the Vk are those among the fc-skyband of T>. So, we 



approximate the fc-skyband with perfect recall, sort those 
lines based on their x-intercept (the dominating cost of the 
algorithm), and introduce a radial plane sweep algorithm to 
build the polygon index. 

k-Skyband Approximation 

The important consideration in our fc-skyband approxima- 
tion is that perfect recall is critical. Otherwise, we may 
miss a line that forms part of the fc-contour. We exploit 
the insight that the fc best lines with respect to each axis 
form a contour relatively close to the real contour, and that 
if a tuple is in the fc-skyband, it clearly must be in the fc- 
skyband of any subset of the data. Thus, the approximation 
algorithm proceeds by quickly determining the < 2k lines as 
above, constructing a contour from them, and determining 
which lines in T) have non-null maxRTOP query answers on 
the approximate contour. See Algorithm [T] 



Algorithm 1 Approximating the fc-skyband of V 
1: Input: T>; k 

2: Output: S C T>, the tuples that form the fc-skyband of 

T>, plus potentially some false-positives 
3: Initialise S, an empty set of tuples 

4: Let X denote the fc tuples in T> with the highest values 
for attribute x 

5: Let y denote the fc tuples in T> with the highest values 
for attribute y 

6: Construct Vxuy, the fc-polygon index on the set X U y 

using Algorithm [2] 
7: for all p G V do 

8: if l p intersects the interior of Vxuy or p G X U y 
then 

9: Add p to S 
10: end if 
11: end for 
12: Free X and y. 
13: RETURN S. 



Radial Plane Sweep 

We construct a contour from a set of lines using a radial 
plane sweep. The idea is to traverse the set of intersection 
points in angular order, maintaining a sorted list of the lines. 
In this way, we build the contour incrementally from the 
positive a;-axis towards the positive y-axis. Traversing in 
this order also allows us to maintain convexity of the contour 
as we go. Like most plane sweeps, a primary advantage is 
that we need only look at intersection points between two 
lines after they become neighbours. If this does not occur 
between the sweep line and the positive y-axis, then we need 
not consider the intersection point at all. Algorithm [2] offers 
the details of the sweep algorithm. 

5.3 Querying the fc-Polygon Index 

Here we present how to query our fc-polygon index to de- 
termine the segments of a line l q that are strictly contained 
within the interior of the fc-polygon, Vk- The algorithm 
(Algorithm [3} is a binary search on the convex hull of the 
polygon, proceeded by a sequential scan of 0(k) edges of 
Vk- The recursion is based on the slope of l q compared to 
the convex hull of Vk at the recursion point. 

5.4 Asymptotic Performance 



Algorithm 2 Building Vk from a fc-skyband approximation 
1: Input: £, an array of lines sorted by ascending x- 
intercept; fc 

2: Output: A dual-array representation of Vk 
3: Initialise an empty array H for convex hull vertices 
4: Initialise an empty array of lists C for concavities 
5: Initialise I as a priority queue containing the |£ — 1 
intersections of neighbouring lines in £, sorted by angle 
from the positive a;-axis, discarding those < 0. 
6: while I is not empty do 
7: Pop next intersection i G X 

8: Let U e f t and l r i g ht be the lines intersecting at i. 
9: if h e ft = Ck-i or l rig ht = Ck-i then 
10: Add i to U 

11: if 3h G U : slope([ft, i\) < slope([/i, h + 1]) then 

12: Add to Ch all vertices between h and i. 

13: Remove all vertices between h and i from H and 

from Cj , Vj 7^ h. 
14: end if 
15: end if 

16: Swap heft and l r i g ht in £ 

17: Add to I the intersection of h e ft with its new neigh- 
bouring line and the intersection of lright with its new 
neighbouring line, provided they are at angles greater 
than that of i and in the positive quadrant 

18: end while 

19: Free!. 

20: RETURN H and C. 



Earlier we stated the asymptotic performance of our algo- 
rithms. Here, now, we have the tools to prove that theorem. 
The basic idea is that a line can only intersect a convex shape 
in two locations and for each of those intersection points, the 
cost of a face traversal is bounded. 

Proof of Theorem 15.11 First, note that a line can only 
intersect the boundary of a convex polygon in at most two 
points, so the binary search tree traversal need follow at 
most two paths. Recall from Lemma 14.71 that each contour 
contains at most n cells, and thus the convex hull contains 
at most n — 1 edges. From Lemma 15.21 the binary search 
requires O(logn). For each of the two intersection points 
found, we traverse the corresponding face sequentially. From 
Lemma 14.81 each of these faces contains 0(k) edges and 
we know that finding the intersection (or, equivalently, as- 
certaining the non-intersection) of two two-dimensional line 
segments requires constant time. 

Since the search is run independently of and its cost dom- 
inates the cost of the face traversals, and since we assume fc 
is O(logn), the entire query procedure is C(logn). 

Regarding the space requirements, Lemma ^ . 7l implies that 
polygon itself can contain at most O(n) vertices. Because 
each vertex could appear at most twice in the data struc- 
ture (one on the convex hull and once in a single concavity) , 
and because the data structure is, simply, the vertices of the 
fc-polygon, the disk space required by the data structure is 
0(n). □ 

Lemma 5.2. The intersection of the query line with the 
convex hull can be determined in O(logn). 

Proof. The intersection algorithm proceeds by binary 
search. First, find the middle vertex v n /2 and determine 



Algorithm 3 Querying a dual-array fc-polygon, Vh 
1: Input: Dual-array representation of Vk, line l q , 

start/end indexes. 
2: Output: Intersection points of l q with Vk 
3: if end — start — 2 then 

4: Traverse the 0(k) list in the concavity array at posi- 
tion start, returning any intersections with l q . 
5: RETURN. 
6: end if 

7: Compute midpoint vertex of H at end ~ start + start. 
8: if lq passes above midpoint then 

9: if slope of l q is less than slope of [midpoint-1, mid- 
point] then 

10: Recurse on lower half with end=midpoint 

11: else if slope of l q is greater than slope of [midpoint, 

midpoint+1] then 
12: Recurse on upper half with start=midpoint 
13: end if 
14: else 

15: if l q passes above vertex at position start then 
16: Recurse on lower half with end=midpoint 
17: end if 

18: if l q passes above vertex at position end then 
19: Recurse on upper half with start=midpoint 
20: end if 
21: end if 



whether the query line passes above or below it. If above 
then recurse left if the query line has shallower slope than 
edge (v n /2,v n /2+i)- Recurse right if the query line has steeper 
slope than edge (v n/2 _ 1 ,v n/2 ). Because edge (v n/2 ,v n/2+1 ) 
is shallower than edge (t> n /2-i> v n/2), at most one recursion 
direction can be followed. 

If, instead, the query line passes below v n / 2l then it is 
inside the contour (if in the correct quadrant at all). To find 
the intersection points, recurse left if the query line passes 
above w n -i- Recurse right if the query line passes above 
vq. It is possible that both conditions are true, but this can 
only occur once, because the truth of the condition implies 
an intersection point and a straight line has at most two 
intersection points with a convex polygon. Therefore, the 
binary search follows at most two distinct paths. □ 

6. EXPERIMENTAL EVALUATION 

Until now, the focus of this paper has been on proving the 
correctness and asymptotic performance of our approach to 
indexing for monochromatic reverse top-A: queries. Here, 
we pursue a different direction, examining the question of 
performance in more detail through experimentation. In 
particular, we seek to address two questions. As we showed 
earlier, if the size of the fc-polygon is \DS\ and the size of 
its convex hull is |Cff|, then the query cost of our index is 
0(k + log \CH\). So, a natural question is what a typical 
value of \DS\ and of \CH\ might be. This is also important 
because it indicates how much space the data structure will 
consume on disk once built. The second question is that of 
raw performance: in how much time can the index be built 
and, later, be queried? To contextualise these performance 
numbers, we compare the performance of our index to that 
of repeatedly executing the non-indexed-based algorithms of 
Vlachou et al. and of Wang et al. 



6.1 Experimental Setup 

For the experiments, we implemented and optimised the 
algorithms of Vlachou et al., of Wang et al., and of this paper 
(Chester et al.) in C and compiled our implementations 
with the GNU C compiler 4.4.5 using the -06 flag. Our 
implementations of Vlachou et al. and of Wang et al. do not 
produce maximal intervals, although, naturally, ours does. 
We ignore the cost of outputting the response, because this 
is moreorless the same for each algorithm. On the other 
hand, each interprets the tuples differently, so we do include 
in the measurements the cost of reading the input files. 

We ran the experiments on a machine with an AMD Athlon 
processor with four 3GHz overclocked cores and 8GB RAM, 
running Ubuntu. The timings were calculated twice, once 
using the linux time command and once using the gnu pro- 
filer by compiling with the -pg flag. 

Datasets. 

We use the regular season basketball player statistics from 
databasebasketball.com and generate five different datasets 
with which to perform our experiments by projecting com- 
binations of two attributes. The attribute combinations are 
chosen to diversify the degree of (anti-)correlation based 
on intuitive reasoning about the attributes. In particular 
we study the following pairs: (Points, Field Goals Made), 
(Defensive Rebounds, Blocks), (Personal Fouls, Free Throw 
Attempts), (Defensive Rebounds, Assists), (Blocks, Three 
Pointers Made). A traditional top-fc query on each pair is 
equivalent to asking for the k best player seasons according 
to a given blend of the skills. (Note that for the first pair and 
a query (1,0), for example, Wilt Chamberlain would appear 
several times, once for each of his sufficiently high scoring 
seasons.) 

We reserve the most recent season, 2009, as a set of 578 
query points and use the other seasons, 1946-2008, as the 
dataset of 21383 tuples. As such, each monochromatic re- 
verse top-fc query is equivalent to asking, "for which blends, 
if any, of the given two skills was this particular player's 
performance this season ranked in the top-fc all-time?" This 
contrasts to traditional analysis of basketball data which 
would look only at the axes at the detriment of rounded 
players. 

In terms of preprocessing on the data, we elect not to re- 
move multiple tuples for players who played on more than 
one team in a given year. We scale the data to the range 
(0, 1] by adding 1 to each value and dividing by the largest 
value for each attribute (plus one), so that the attributes are 
comparable in range. Also, we slightly perturb the data so 
as not to violate our general position assumption by adding 
10~ 8 to each duplicated scaled value until all values are 
unique for each attribute. To construct our index and to 
process incoming queries, we assume a value of r = .50 

6.2 Experimental Results 

One intention of these experiments was to illustrate how 
construction and query time varied for our algorithm with 
respect to k and different attribute combinations. However, 
the execution time of our algorithm is pretty much constant 
across values of k and choices of attributes on the basket- 



3 This choice really is arbitrary within reason. We tried sev- 
eral values in the range [.25, 1.5] without any effect on the 
output. 




Figure 2: Execution time for the implemented algorithms on the batch of 578 queries comprised of 2009 
basketball statistics, using the statistics from 1946-2008 as a dataset. Defensive Rebounds is regarded as the 
a;-attribute and Assists is regarded as the y-attribute. This is meant to reflect anticorrelated attributes, but 
the data appears to be more correlated. 




Figure 3: Execution time for the implemented algorithms on the batch of 578 queries comprised of 2009 
basketball statistics, using the statistics from 1946-2008 as a dataset. Blocks is regarded as the a>attribute 
and Three Pointers Made is regarded as the y-attribute. This is meant to reflect anticorrelated attributes. 



ball dataset. In fact, across all experiments the construc- 
tion time has an average duration of 34ms with a standard 
deviation of 9.1ms. The query time has an average dura- 
tion of 480/xs with a standard deviation of 22/is. The total 
time for construction and querying averages 35ms and has 
a standard deviation of 8.8ms. Figures [2141 illustrate the to- 
tal execution times for the three algorithms on three of the 
attribute combinations^ We observed that the algorithms 
of Vlachou et al. and of Wang et al. are rather sensitive 
to the sortedness of the input data, so we report their per- 
formances both for when the data is presorted by y value 
and when that presorted file is randomized with the linux 
command sort -R. 



4 We omitted results for the pair (Points, Field Goals Made) 
because it was very similar to the results for the pair (Defen- 
sive Rebounds, Blocks) and the pair (Personal Fouls, Free 
Throw Attempts) because it was very similar to the results 
for the pair (Defensive Rebounds, Assists), just with a larger 
separation between the lines. 



The other primary intention of the experiments was to 
evaluate the size of our data structure, particularly since it 
has a strong effect on the query time. We show the contours 
generated for k — [1,4] in Figures [5] and [6] for two of the at- 
tribute combinations^ We illustrate in Figures [7] and [8] how 
the size of the data structure varies with k on the attribute 
pairs (Personal Fouls, Free Throws Attempted) and (Points, 
Field Goals Made), respectively. The former pair produced 
the largest data structures and the latter, the smallest. The 
other three experiments all exhibited very similar behaviour, 
with the convex hull remaining relatively constant and the 
total size growing linearly with k, and magnitudes between 
these examples. 

6.3 Discussion 

Overall, our indexing does quite well, with a query cost 
slightly less than lfis per query, independent of k, typically 

5 We omit figures for the other three combinations because 
the contours are too close together to interpret easily. 




Figure 4: Execution time for the implemented algorithms on the batch of 578 queries comprised of 2009 
basketball statistics, using the statistics from 1946-2008 as a dataset. Defensive Rebounds is regarded as the 
^-attribute and Blocks is regarded as the y-attribute. This is meant to reflect correlated attributes, but the 
data appears to be more anticorrelated. 




Figure 7: The size of the contours derived on the 
Figure 5: The first four contours derived on the bas- basketball dataset with personal fouls as the x at- 
ketball dataset with personal fouls as the x attribute tribute and free throws attempted as the y attribute, 
and free throws attempted as the y attribute. shown as a function of k. 
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Figure 6: The first four contours derived on the bas- 
ketball dataset with blocks as the x attribute and 
three pointers made as the y attribute. 



three to four orders of magnitude faster than the other two 
algorithms. In fact, our algorithm in most cases runs one 
or two orders of magnitude faster, even with the construc- 
tion cost included. This implies that, while the purpose of 
this technique is to support an indexing scenario, the in- 
dex construction is sufficiently fast to render it feasible in 
non-indexing scenarios, too. 

That the query time for the index does not vary much is 
not surprising in light of the results of the data structure 
size analysis. We see from Figures [7] and [8] that the convex 
hull is consistently under forty vertices, and, from Lemma 
4.8, we know that \DS\ < (2k - l)|Cff|, which explains the 
growth of \DS\ with respect to k. 

Since the query cost of our index is thus C(log 40 + k), 
our performance is quite realistic. The speed of the con- 
struction is more surprising, on the other hand, since its cost 
is proportional to the number of intersection points in the 
positive quadrant of the dual data lines. This could be re- 
lated to the choice of dataset because there could be a strong 
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Figure 8: The size of the contours derived on the 
basketball dataset with personal fouls as the x at- 
tribute and free throws attempted as the y attribute, 
shown as a function of fc. 

stratification of the lines such that they do not intersect in 
the positive quadrant given how strongly the statistics are 
influenced by playing time. Nonetheless, quick construction 
time is not the primary objective of the index, anyway. 

It is worth noting that there are a few instances in which 
the algorithm of Vlachou et al. outperforms our index for 
low values of fc, especially on sorted data. (This is especially 
noticeable in Figure 3] wherein the algorithm accumulates 
no time at the granularity of the time command.) This can 
be easily explained because as soon as k points are seen in 
the dataset that dominate the query, a null result can be 
reported and the Vlachou et al. algorithm can be halted. 
When k is low, this is substantially more likely. When the 
data is sorted, these dominating points will be among the 
first seen. 

A last observation for discussion is the difference in the 
shape of the contours produced by different data distribu- 
tions (Figures [S] and . The exaggerated slopes in the for- 
mer, contrasted against the intricate weaving patterns in the 
latter, reflect the anticorrelatedness of the underlying data. 
Insight into the shapes of contours could be a grounds for 
future work on fc-polygon construction algorithms. 

7. CONCLUSION AND FUTURE WORK 

In this paper we introduced an index structure to asym- 
potically improve query performance for reverse top-fc queries. 
We approach the problem novelly by representing the dataset 
as an arrangement of lines and demonstrating that embed- 
ded in the arrangement is a critical fc-polygon which encodes 
sufficient knowledge to respond to reverse top-fc queries. In 
particular, we show that by applying the same transforma- 
tion to the query tuple to produce a query line l q , we can 
retrieve the response to the reverse top-fc query on q by in- 
tersecting l q with the interior of the fc-polygon. 

We derive geometric properties of the problem to bound 
the query cost and size of our data structure as C(log n) and 
0(n), respectively. We also conduct an experimental anal- 
ysis to augment our theoretical analysis and demonstrate 
both that our algorithm significantly outperforms literature 
as the number of queries increases and that our data struc- 
ture requires little disk space. 



We believe this work can be extended in many directions. 
Particularly, we feel that our index structure could lead 
to improved execution times for bichromatic reverse top-fc 
queries, as well. Also, our geometric analysis of the prob- 
lem space offers insight into traditional, linear top-fc queries, 
and exploring whether some of this research can be applied 
in that context is an interesting avenue. Thirdly, still the 
difficult question of higher dimensions, and especially the 
question of how to represent solutions to higher dimension 
maxRTOP queries, is open. 
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